Welcome to the second part of the tutorial where we are going to have a look at another popular social media platform YouTube
library(knitr)
#library(magick)
library(png)
library(tuber)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.6
## ✔ tidyr 0.8.1 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## Warning: package 'dplyr' was built under R version 3.5.1
## ── Conflicts ────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(tidytext)
library(grid)
#library(emo)
library(icon)
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
library(psych)
##
## Attaching package: 'psych'
## The following object is masked from 'package:icon':
##
## fa
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(psych)
options(stringsAsFactors = FALSE)
tuberOnce you setup your YouTube access via the Google Developer Console, you can connect to the API and download data.
A reminder: to connect you use yt_oauth().
app_id = "667235664106-a5e6ng7pna0ptv7qoqnrsn0ldm1i68a8.apps.googleusercontent.com"
app_secret = "8UV8kvt8cZ75EXg-ubOPBjDJ"
#connect
yt_oauth(app_id=app_id, app_secret=app_secret, token='', cache=FALSE)
By default the function looks for .httr-oauth in the working directory in case you connected before. If it doesn’t find it, it passes an application ID and a secret. If you do not want the function to use cache, set cache=FALSE.
The function launches a browser to allow you to authorize the application
Once you are connected, we can searching!
results <- yt_search("World Cup 2018")
kable(results[1:4,1:5])
| video_id | publishedAt | channelId | title | description |
|---|---|---|---|---|
| qlZaiBuOaz4 | 2017-09-09T19:17:50.000Z | UC07CHp5ikd-AyafR4J2LWkA | No T20 World Cup In 2018, Australia Will host Next WT20 in 2020 | It is now officially confirmed by ICC that the 2018 season of ICC WT20 Will not gonna happen,Australia now awarded as the next destination of icc t20 world cup … |
| sJSH7SM1La0 | 2018-07-06T18:00:03.000Z | UCH0lge-0lFszfPRqFkaoXpw | Brasile Vs Belgio “Calci di Rigore” World CUP 2018 Quarti di Finale | PES 2018 Patch [Giù] | PES 2018 - Stadium Pack v2 All AIO Download … |
| Ff554RwjHi0 | 2018-01-21T11:03:01.000Z | UCoNOFr0J8S2yMmX3lB2gs8g | ICC World T20 2018 Schedule, Time Table, Venue, Host Country, Fixtures, News | ICC World T20 2018 Schedule, All Teams, Time Table, Venue, Host Country, Groups, Fixtures, Important News, Theme and World T20 2018 logo. |
| aWxbphbmDts | 2018-01-22T15:48:13.000Z | UCqMp3kP5EP0YXANgY-I43JA | T20 World Cup 2018 Full Schedule And Time Table ICC World 20 20 Cup | ICC T20 World Cup 2018 Full Schedule And Time Table in urdu hindi ICC t20 World Cup 2018 ICC t20 World Cup full time table Hind Pakistan Vs India T20 … |
The yt_search function returns a data.frame with 16 elements, including
video_id: id of the video that can be used later to get its statspublishedAt: date of the publicationchannelId: id of the channel that can be used later to get access to the channeltitle: title of the videodescription: a short description of the vidlchannelTitle: a more “user-friendly” name of the channelThe yt_search function takes parameters that let your specify your search. The most popular ones are:
term: a search term to be used, searches across different types, including video (the default one), channel and playlist.max_results: maximum number of items to be returned with the default set to 50 and maximum allowed = 500channel_id: search videos on the channel and requires a channel idtype: specify the type of the media and can take one three values: video (default), channel, playlist.published_after and published_before: specifies the timeframe for searchNow let’s have a look at comments of the video. We can pick any video in our search and have a look at its comments, using video_id to get them
results <- get_comment_threads(c(video_id="qlZaiBuOaz4"))
kable(results[1:10, c("authorDisplayName", "textDisplay")])
| authorDisplayName | textDisplay |
|---|---|
| Suraj Yadav | Sale jhooth bolta hai faltu channal ko subscribe nhi kiya jata |
| Meena Srivastava | Nice video bro |
| bdayal singh | My favorite tournament is T20 world cup |
| DHRUV SAWANT | 2020 me hoga |
| Nikhil Yadav | GU hai TU guuu |
| Vandana2 Rajpurohit | yash bhai app tho ipl ke hi video banate ho aap world cricket ke upar bhi video banao |
| Arjun Bhandari | Ipl k bareme batao jiii rcb k bareme kya change hoga |
| Vikas Mathur | Bhosdee ke tuje bada pata h |
| Munmun Samanta | chip Kar faltu |
| Mohammed Rafique | Asia cup kab hoga bhai |
You specify your video id in the filter argument. You can also use this argument to download comments for a specific channel. So, let’s have a look at your favourite YouTube channel. Which one is your favourite? Mine is Bloomberg
But… I guess.. it’s getting too busy for the day, so let’s take a break and have a look at Victoria’s Secrets
It’s Brisbane, Australia after all!
To locate the channel we need to use channel id, not its name.
Obtaining data from YouTube channel and have a look at its stats. is usually done through channel_id, which is not the same as the YouTube name you see in the YouTube link.
There are several ways to obtain your channel_id (YouTube name):
https://www.googleapis.com/youtube/v3/channels?key={YOUR_API_KEY}&forUsername={USER_NAME}&part=id where YOUR_API_KEY is the key you create in the google developer account USER_NAME is the YouTube channel (username)externalId or data-channel-external-id. The value there will be the channel id.Now that we have the channel id, lets get a list of its videos and have a look at its stats
channel_id<-"UChWXY0e-HUhoXZZ_2GlvojQ"
videosVS = yt_search(term="", type="video", channel_id = channel_id)
kable(videosVS[1:4,1:5])
| video_id | publishedAt | channelId | title | description |
|---|---|---|---|---|
| HLo_3GfHjps | 2012-11-30T17:10:00.000Z | UChWXY0e-HUhoXZZ_2GlvojQ | Miranda Kerr and Bruno Mars Backstage at the 2012 Victoria’s Secret Fashion Show | Get ready for dimples, winks and giggles when Miranda Kerr quizzes Bruno Mars backstage at the 2012 Victoria’s Secret Fashion Show. Catch them both on the … |
| 76gf3YeGf1s | 2010-11-30T17:49:42.000Z | UChWXY0e-HUhoXZZ_2GlvojQ | Before I Was a Supermodel: Behati Prinsloo | On the eve of the 2010 Fashion Show, Supermodel Behati Prinsloo shares some memories about growing up and becoming a model. |
| wMYYQkwLDqM | 2013-10-17T15:27:54.000Z | UChWXY0e-HUhoXZZ_2GlvojQ | Candice Swanepoel Meets the Royal Fantasy Bra | Go on set as Victoria’s Secret Angel Candice Swanepoel gets to see and try on the Royal Fantasy Bra for the very first time. Look for Candice and the $10 million … |
| DkcctlhyWEs | 2015-08-22T12:55:46.000Z | UChWXY0e-HUhoXZZ_2GlvojQ | Romee Strijd on Becoming a Victoria’s Secret Angel | Dutch supermodel Romee Strijd talks about how she was discovered, being cast for the Victoria’s Secret Fashion Show and her journey to becoming an Angel. |
#get channel stats
statsVS<-get_channel_stats(channel_id=channel_id)
## Channel Title: Victoria's Secret
## No. of Views: 265955120
## No. of Subscribers: 1504923
## No. of Videos: 743
statsVSSelected <- as.vector(statsVS$statistics)
results<-do.call(rbind, statsVSSelected)
head(results)
## [,1]
## viewCount "265955120"
## commentCount "0"
## subscriberCount "1504923"
## hiddenSubscriberCount "FALSE"
## videoCount "743"
The get_channel_stats function is quite straightforward. It take channel_id as an argument and returns a nested list. We can select the items we need from the list and convert it to a data.frame
Now that we have a list of videos from VS channel, let’s download stats for each video. As an example let’s do first 10
videosVS_sample<-videosVS[1:10,]
videoStatsVS = lapply(as.character(videosVS_sample$video_id), function(x){
get_stats(video_id = x)
})
videoStatsVS_df = do.call(rbind.data.frame, videoStatsVS)
head(videoStatsVS_df)
## id viewCount likeCount dislikeCount favoriteCount commentCount
## 2 HLo_3GfHjps 2260468 16320 277 0 1826
## 21 76gf3YeGf1s 885113 5463 72 0 299
## 3 wMYYQkwLDqM 2490444 17828 299 0 1305
## 4 DkcctlhyWEs 954050 9356 202 0 402
## 5 anhEcKWRzyE 2167444 37678 569 0 1674
## 6 LwelJNogl-M 883141 7715 88 0 200
The function uses video ids, but does not return video title and dates, which we can add ourselves and do some clean-up
videoStatsVS_df$title = videosVS_sample$title
videoStatsVS_df$date = videosVS_sample$date
library(tidyverse)
videoStatsVS_df = as.tibble(videoStatsVS_df) %>%
mutate(viewCount = as.numeric(as.character(viewCount)), #originally as factor
likeCount = as.numeric(as.character(likeCount)),
dislikeCount = as.numeric(as.character(dislikeCount)),
commentCount = as.numeric(as.character(commentCount)))
head(videoStatsVS_df)
## # A tibble: 6 x 7
## id viewCount likeCount dislikeCount favoriteCount commentCount title
## <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
## 1 HLo_3… 2260468 16320 277 0 1826 Mira…
## 2 76gf3… 885113 5463 72 0 299 Befo…
## 3 wMYYQ… 2490444 17828 299 0 1305 Cand…
## 4 Dkcct… 954050 9356 202 0 402 Rome…
## 5 anhEc… 2167444 37678 569 0 1674 Road…
## 6 LwelJ… 883141 7715 88 0 200 Vict…
I ran the function with the full list of video ids for the channel and you can use this file videoStatsVS_df.csv under the YouTube tutorial-data folder. Let’s load it
videoStatsVS_df <- as.data.table(read.csv("YouTube tutorial-data/videoStatsVS_df.csv", stringsAsFactors=FALSE))
head(videoStatsVS_df)
## X id viewCount likeCount dislikeCount favoriteCount
## 1: 1 HLo_3GfHjps 2259491 16317 277 0
## 2: 2 76gf3YeGf1s 884766 5463 72 0
## 3: 3 wMYYQkwLDqM 2490140 17827 299 0
## 4: 4 DkcctlhyWEs 953821 9353 202 0
## 5: 5 anhEcKWRzyE 2165304 37656 569 0
## 6: 6 LwelJNogl-M 883017 7711 88 0
## commentCount
## 1: 1826
## 2: 299
## 3: 1305
## 4: 402
## 5: 1673
## 6: 200
## title
## 1: Miranda Kerr and Bruno Mars Backstage at the 2012 Victoria's Secret Fashion Show
## 2: Before I Was a Supermodel: Behati Prinsloo
## 3: Candice Swanepoel Meets the Royal Fantasy Bra
## 4: Romee Strijd on Becoming a Victoria’s Secret Angel
## 5: Road to the Runway: Episode 1 – Castings
## 6: Victoria’s Secret Angel Outtakes 2014
Let’s see which video was most popular:
The most view counts are:
videoStatsVS_df %>% arrange_(~ desc(viewCount)) %>%
top_n(n = 5) %>%
select(title, viewCount, likeCount, favoriteCount, commentCount, id)
## title viewCount likeCount favoriteCount
## 1 Why Alessandra Ambrosio loves her body 773575 3250 0
## 2 You've Got Male 422072 881 0
## 3 Wild & Beautiful: Behind the Scenes 338447 1115 0
## 4 Why Lindsay Ellingson loves her body 178581 320 0
## 5 Why Emanuela DePaula loves her body 118804 339 0
## commentCount id
## 1 23 EMOxRWd-Q1k
## 2 19 drxva8UM-zM
## 3 17 nmTT42Q0bsI
## 4 17 B3QSmqcTojQ
## 5 11 jANpsZGKPXA
Let’s have a look at it!
The most likes are:
videoStatsVS_df %>% arrange_(~ desc(likeCount)) %>%
top_n(n = 5) %>%
select(title, likeCount, viewCount, favoriteCount, commentCount, id)
## title likeCount viewCount favoriteCount
## 1 Why Alessandra Ambrosio loves her body 3250 773575 0
## 2 Wild & Beautiful: Behind the Scenes 1115 338447 0
## 3 You've Got Male 881 422072 0
## 4 Why Emanuela DePaula loves her body 339 118804 0
## 5 Why Lindsay Ellingson loves her body 320 178581 0
## commentCount id
## 1 23 EMOxRWd-Q1k
## 2 17 nmTT42Q0bsI
## 3 19 drxva8UM-zM
## 4 11 jANpsZGKPXA
## 5 17 B3QSmqcTojQ
The most comments got:
videoStatsVS_df %>% arrange_(~ desc(commentCount)) %>%
top_n(n = 5) %>%
select(title, commentCount, viewCount, likeCount, favoriteCount, id)
## title commentCount viewCount likeCount
## 1 Why Alessandra Ambrosio loves her body 23 773575 3250
## 2 You've Got Male 19 422072 881
## 3 Wild & Beautiful: Behind the Scenes 17 338447 1115
## 4 Why Lindsay Ellingson loves her body 17 178581 320
## 5 Why Emanuela DePaula loves her body 11 118804 339
## favoriteCount id
## 1 0 EMOxRWd-Q1k
## 2 0 drxva8UM-zM
## 3 0 nmTT42Q0bsI
## 4 0 B3QSmqcTojQ
## 5 0 jANpsZGKPXA
Let’s have a look at titles now and see what we can find there.
Continuing with the “girl power” theme, let’s compare VS to another powerhouse, US Vogue.
Following the procedure described earlier, I downloaded video stats for both VS and US Vogue channels and merge them into one file, videostats_All.csv
Further analysis will include manipulation with text, so we will need tidyverse and tidytext packages
library(tidyverse)
library(tidytext)
you can either download the channel stats yourself (see above) or use videostats_All.csv
Channel ids are: * Americanvogue = UCRXiA3h1no_PFkb1JCP0yMA * VICTORIASSECRET= UChWXY0e-HUhoXZZ_2GlvojQ
To load the existing file let’s do this
videostats_All <- as.data.table(read.csv("YouTube tutorial-data/videostats_All.csv", stringsAsFactors=FALSE))
head(videostats_All)
## date
## 1: 2016-01-05
## 2: 2016-01-18
## 3: 2016-01-19
## 4: 2016-01-22
## 5: 2016-02-02
## 6: 2016-02-11
## title
## 1: Inside the Brooklyn Home of Artist Mickalene Thomas
## 2: Watch Irene Kim’s Japanese Hot Springs Adventure
## 3: 73 Questions With Derek Zoolander
## 4: Watch Pop Star in the Making Zella Day Perform a Heartbreaking Ballad
## 5: Cleaning House With Organizing Guru Marie Kondo
## 6: Here’s What It’s Like to Be Lucky Blue Smith During Fashion Week
## video_id viewCount likeCount dislikeCount commentCount source
## 1: SMX60fh5u7o 29346 1304 7 27 Vogue
## 2: r-shJpCflvQ 59103 1206 29 38 Vogue
## 3: H4q0K561WXs 2395735 34455 1817 1754 Vogue
## 4: wg19p1cDSRE 58920 2104 9 101 Vogue
## 5: z3OXvQZe7g8 392529 4121 870 171 Vogue
## 6: LIKTr9vXRQI 141677 2853 52 131 Vogue
Let’s have a brief look at the data. We are going to use the stargazer package which is fantastic for generating “academic” looking results and describeBy function from the psych package that generates statistics by a grouping variable. We will group variables by channel is.
library(stargazer)
stargazer(videostats_All[,.(viewCount, likeCount, commentCount, dislikeCount)], median=TRUE, digit= 1, type = "text")
##
## ============================================================================================
## Statistic N Mean St. Dev. Min Pctl(25) Median Pctl(75) Max
## --------------------------------------------------------------------------------------------
## viewCount 456 1,199,256.000 2,452,764.000 6,477 115,134 367,336.5 1,082,360.0 20,665,383
## likeCount 456 25,219.360 66,559.910 145 1,992 5,040.5 18,092 747,921
## commentCount 455 1,150.145 3,401.707 4.000 85.500 202.000 707.500 31,101.000
## dislikeCount 456 727.618 2,297.890 1 33 88.5 425 29,639
## --------------------------------------------------------------------------------------------
##
## =
## 1
## -
library(psych)
results<-describeBy(videostats_All[, .(viewCount, likeCount, commentCount, dislikeCount)],
group=videostats_All$source, digits=1, mat=TRUE)
results[,c(1:7, 10:11)]
## item group1 vars n mean sd median min
## viewCount1 1 Vogue 1 307 1512729.9 2874266.3 478164 6477
## viewCount2 2 VS 1 149 553373.0 889076.4 208598 31747
## likeCount1 3 Vogue 2 307 35448.6 79096.3 10367 145
## likeCount2 4 VS 2 149 4143.1 4528.7 2468 722
## commentCount1 5 Vogue 3 306 1611.7 4067.6 359 4
## commentCount2 6 VS 3 149 202.3 235.9 111 17
## dislikeCount1 7 Vogue 4 307 1042.1 2745.7 211 1
## dislikeCount2 8 VS 4 149 79.7 136.8 43 5
## max
## viewCount1 20665383
## viewCount2 5348750
## likeCount1 747921
## likeCount2 37661
## commentCount1 31101
## commentCount2 1673
## dislikeCount1 29639
## dislikeCount2 1189
Just a reminder that: * mean: average * st. dev: is a measure of variation in the data compared to the average * min and max: extreme values
Likely that the number of views relates to the number of likes: the more people view the video, the more they “like” it. We can do a correlation for this using corr.test function from the same psych package
results<-corr.test(videostats_All[, .(viewCount, likeCount, commentCount, dislikeCount)], use = "complete",method="pearson",adjust="holm",
alpha=.05,ci=FALSE)
results$r
## viewCount likeCount commentCount dislikeCount
## viewCount 1.0000000 0.8130140 0.8078345 0.8286073
## likeCount 0.8130140 1.0000000 0.9469651 0.6756837
## commentCount 0.8078345 0.9469651 1.0000000 0.7457451
## dislikeCount 0.8286073 0.6756837 0.7457451 1.0000000
Or we can have a plot it on a graph with the help of gglot2 and gridExtra
library (ggplot2)
library(gridExtra)
p1=ggplot(data = videostats_All[-1, ]) + geom_point(aes(x = viewCount, y = likeCount))
p2=ggplot(data = videostats_All[-1, ]) + geom_point(aes(x = viewCount, y = dislikeCount))
p3=ggplot(data = videostats_All[-1, ]) + geom_point(aes(x = viewCount, y = commentCount))
grid.arrange(p1, p2, p3, ncol = 2)
## Warning: Removed 1 rows containing missing values (geom_point).
You just cannot LOVE Adriana Lima! But move on….
As you see the title column has a title that describes the video. It is logically to assume that title is the first to attract attention of the viewer. Let’s have a closer look and see if we can identify specific words.
Let’s tokenize the title, clean it from stop words and calculate frequencies of words in the title. Frequency is calculated as the number of times a particular word is used in the title compared to the total number of different words used in the channel.
title_words_All_Source<-videostats_All %>%
as.tibble() %>%
unnest_tokens(word, title) %>%
anti_join(stop_words) %>%
count(source, word, sort = TRUE) %>%
left_join(videostats_All %>%
group_by(source) %>%
summarise(total = n())) %>%
mutate(freq = n/total)
## Joining, by = "word"
## Joining, by = "source"